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Abstract 

Probability and statistics substantially underlie scientific and technologic research. In this work we provide an intro¬ 
duction to of some of the most important basic statistical topics, from the concept of random experiment to parametric 
and non-parametric estimation, also including statistical moments as well as random variable transformations. Special 
attention is given to the adoption of the Dirac delta ‘function’ as a means of achieving a unified modeling approach to 
discrete and continuous random variables and probability distributions. 


‘La semplicita e la sofisticazione finale.’ 

Leonardo da Vinci. 

1 Introduction 

Will it rain tomorrow? Given the dry season right now 
in my place, it is unlikely, but not impossible. We live 
in a world of enduring uncertainties, and we can only 
be certain that nothing is absolutely certain or precisely 
predictable. 

Along millennia, humans have had to cope with un¬ 
certainties regarding future events. Probability and sta¬ 
tistical theory (e.g. mimi) the most well-succeeded 
approaches that can help us to understand and handle un¬ 
certainties. Starting with analyses of games of chance, the 
field of probability and statistics found its way into virtu¬ 
ally every scientific and technological area, from physics 
to biology. Currently, probability and statistics play a 
special role with respect to artificial intelligence, pattern 
recognition and deep learning. 

Despite its vast range of applications and general im¬ 
portance, the basic statistical concepts and methods are 
relatively simple and accessible. In the current work, we 
present an introductory and relatively informal first look 
at some of the most important related topics. 

We treat both discrete and continuous random variables 
and probability distributions in a unified manner, thanks 
to the adoption of the Dirac delta ‘function’. Some im¬ 
portant continuous probability distributions are presented 


and discussed, including the uniform, constant and nor¬ 
mal distributions. Moment characterization of random 
variables is also briefly addressed, including central mo¬ 
ments. The possibility of transforming one random vari¬ 
able into another, and how to obtain the probability dis¬ 
tribution of the latter in some cases are also presented, 
as well as the important topics of parametric and non- 
parametric estimation. 

2 Random Experiments, Out¬ 
comes, and Events 

One important concept is statistics is that of a random 
experiment of interest, such as throwing a dice, or trying 
to predict the weather. More formally, a random experi¬ 
ment is such that its outcome is uncertain. For general¬ 
ity’s sake, experiments with known outcome are also un¬ 
derstood as random experiments, so that virtually every 
experiment can be treated as being random. 

It is essential to specify as much precisely as possible the 
random of experiment of interest. For instance, in the case 
of the above example of throwing a dice, we need to define 
which dice will be used (its shape, weight, size, hardness, 
etc.), when and where the experiment will be performed, 
how the dice will be thrown (and this involves many me¬ 
chanical aspects such as angle, force, height, etc.), as well 
as every other aspect that can influence the experiment 
outcome such as air resistance, surface friction, and so on. 

Given that any event in the real world is potentially 
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influenced by an infinite range of effects (e.g. even the 
moon gravitation can have some influence, tiny as it may 
be, on the dice outcome [?, El). we conclude that random 
experiments cannot be fully specified , implying some level 
of error and also that unexpected or biased results can be 
obtained as a consequence of the incomplete description. 
In addition, we also have experimental errors and noise 
typically affecting the measurements. 

This situation is completely analogue to that found in 
scientific modeling (e.g. 0). in which the validity and 
predicting power of any model is limited by not being 
able to incorporate every effect potentially influencing the 
phenomenon of interest. The best one can hope for is 
obtaining an as much as possible complete specification of 
the random experiment of interest, as well as controlling 
its environment. 

Specifying the random experiment is so important be¬ 
cause it allows us to identify the set fI of all possible 
outcomes , also called the universe set. For instance, in 
the case of throwing a dice and observing the result, we 
have SI = { 1 , 2 ,3,4,5, 6 }. 

It is now possible to define event as any subset of Cl. 
Recall that the empty set 0 is always a subset of any 
set, including Cl. The subset A 0 = 0 is called the im¬ 
possible event. Other possible events in the case of the 
dice example are A\ = { 1 }, A 2 = {1,3}, A 3 = {2,4,6}, 
A 4 = {1, 2, 3,4, 5, 6 } = Cl. The latter is called the cer¬ 
tain event, as it necessarily contains one of the possible 
outcomes of the random experiment. 

It is very important to keep in mind that event and out¬ 
come are not equal. For instance, in the case of the dice, 
we have events that are sets of outcomes, such as {1, 3,5}. 
The outcomes in the dice example are only the numbers 
1, 2, 3, 4, 5, and 6 . Moreover, events are necessarily sets 
(by being subsets), while outcomes are individual obser¬ 
vations (e.g. the number on the facets of a dice). So, in 
brief: 


Event 7 ^ outcome. 

As events are sets, given two events A and B, it is 
possible to consider their intersection An B, union AUP, 
complementation A c = Cl — A, etc. When A n B = 0, 
events A and B are said to be mutually exclusive. 

3 The Concept of Probability 

As defined, events provide a relatively precise specifica¬ 
tion of any possible result we could be interested regard¬ 
ing a random experiment. Observe that events include 
sets containing the individual experiment outcomes, but 
are not limited to them as they can refer to combinations 
of outcomes. For instance, in the case of dice throwing, 


the composite event { 1 , 2 } is understood as obtaining 1 
or 2 as result. 

Given that we can specify situations of particular inter¬ 
est as events, it is now important to associate probabilities 
to them. This can be done in more than one way, includ¬ 
ing theoretic and experimental ways, as we will discuss 
soon. First, let’s formalize probability as a function P() 
acting on an event A and producing a real value P(A) in 
the interval [ 0 , 1 ] as a result, i.e. 

P : Ac P(A) £ [0,1] £ R (1) 

We necessarily have that 

P(tt) = 1 (2) 

P(0) = 0 (3) 

P(A C ) = 1 - P(A) (4) 

P(A U B) = P(A) + P(P) - P(A n B) (5) 

We also have that when P(A n B) = P(A)P(P), the 
two events A and B are said to be independent. 

Theoretical definitions of the probability of an event 
typically refer to an abstract random experiment. For in¬ 
stance, we can consider a hypothetical random experiment 
involving a perfectly symmetric dice. In this case, it fol¬ 
lows by symmetry that all outcomes have the same prob¬ 
ability, i.e. are equiprobable , implying P({1}) = P({2}) = 
P({3}) = P({4}) = P({5}) = P({ 6 }) = p. Observe that, 
formally speaking, we cannot write P(l), using P({ 1 }) 
instead, since ‘ 1 ’ is not an event (which is necessarily a 
set), but an individual outcome. 

Now, we need to determine p. This can be achieved 
by considering that the set of considered events are 
mutually exclusive, which allows us to apply Equa- 
tion[| P ({1} U {2} U {3} U {4} U {5} U { 6 }) = P({1}) + 
P({2}) + P({3}) + P({4}) + P({5}) + P({ 6 }) = 

P({1,2,3,4,5, 6 }) = 6 p = 1, so that p = 1/6. 

However, when dealing with real-world situations, sym¬ 
metry is hardly perfect and cannot have fuly comprehen¬ 
sive information about the random phenomenon of inter¬ 
est. Therefore, it is necessary to resource to performing 
the random experiment many times while taking record 
of the outcomes. 

For instance, let’s suppose that we have a specific dice 
to be through in specific situations, and the events corre¬ 
sponding to observing each of the six dice facets. Assume 
that the experiment was performed N = 100000 times, 
and the results were as shown in the following table. 

The probability of each of these events A^ can be esti¬ 
mated as the ratio between the number of occurrences of 
Ai and the total number of random experiments N, i.e. 
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Table 1: Counts of obtaining each of the faces of a dice as an event, 
and respectively estimated probabilities. 


event 

count 

probability P 

Ai = { 1 } 

167938 

0.167938 

ii 

(N 

170680 

0.170680 

^3 = {3} 

173784 

0.173784 

II 

173761 

0.173761 

A 5 = {5} 

167582 

0.167582 

A 6 = { 6 } 

146255 

0.146255 


P{Ai) 


N 


( 6 ) 


where ff{Af) means the number of elements (or cardi¬ 
nality) of set Ai. 

We observe that these probabilities are unlikely to be 
equiprobable, as they present relatively large deviations 
from the expected value of 1/6 after one million experi¬ 
ments. 

The probability to be assigned to real world events is 
formally defined as an extension of the above equation as 
the number of random experiments tends to infinity, i.e. 


P(A,) 


lim 

N—> oo 


#(A ; ) 

N 


(7) 


Interestingly, the law of large numbers (e-g. 0) ensures 
that, provided a random experiment preserves its char¬ 
acteristics along time (a stationary experiment), the esti¬ 
mation of the respective probabilities by using Equation [7] 
will converge to its actual value as we approach an infinite 
number of samples. 

However, full convergence cannot be obtained in prac¬ 
tice, as it is impossible to perform an infinite number of 
experiments, and also because the experiment conditions 
are not completely specified and/or tend to change along 
time. For instance, the dice can get worn, the air density 
may change, etc. That is why is so important to keep the 
environmental conditions as stable as possible during the 
experiments. 


random experiment with discrete outcomes, it would be 
very convenient to incorporate continuous measurements 
often found in practice into the so far developed frame¬ 
work. Examples of these include the weight of an apple 
and the outside temperature today. 

In probability and statistics, these measurements are 
called random variables. Keep in mind that we will hence¬ 
forth understand random variables typically correspond¬ 
ing to the outcomes of a specific random experiment, but 
formally the probability of a random variable refers to the 
set containing each respective outcome. 

Observe that the facets of a dice can already be under¬ 
stood as a numeric measurements with discrete values. It 
is also possible to have discrete outcomes with categoric 
nature, such as the city in which a person was born, or 
color names. In such cases, it is still possible to map the 
categorical labels into respective numeric values, though 
this is often imply some arbitrariness. 

What is necessary now is to extend the concept of prob¬ 
ability to continuous outcomes, that therefore can take 
an infinite number of values. For instance the weight of 
an apple is non-negative real value. All the already men¬ 
tioned outcomes or measurements are called random vari- 
ables , which are often identified by capital letters such as 
X. Small caps are typically reserved for expressing the 
values of the respective random variable. 

In order to integrate discrete and continuous random 
variables into a unified approach to statistical models, we 
will resort to the Dirac delta ‘function’. Formally speak¬ 
ing, this ‘function’ is not a function, but a functional that 
maps a function g(x) into its value at zero, i.e. g( 0 ), as 
studied in a branch of mathematics called distribution the¬ 
ory (e.g. (HJ). 

However, for simplicity’s sake, in the current text we 
will understand the Dirac delta ‘function’ more informally 
as the limit of some function that has area 1 , such as 
the rectangular function r(x) centered at the origin and 
having width a and height 1/a. The Dirac delta function 
can be understood as the limit of this function as we make 
a smaller and smaller, i.e. 

<5(x) = lim r(x) ( 8 ) 

a—too 

Though the height of this function increases continu¬ 
ously as its width is decreased, its unitary area is con¬ 
served, and we can write 


4 Random Variables and Proba¬ 
bility Distributions 

Though we described a means of assigning probabilities 
to events corresponding directly to specific outcomes of a 


S(x)dx = 1 


(9) 


Observe that the Dirac delta can also be defined as the 
limit of an infinite number of alternative functions, such 
as the gaussian normalized to have unit area. 

Informally, we can also write 5{x)g{x) = 5{x)g{ 0), 
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which expresses the ‘sampling’ capability of the Dirac 
delta function, meaning that 


S(x)g(x)dx = g{ 0) 


( 10 ) 


Interestingly, we can now use the Dirac delta function 
to represent the six probabilities associated to the dice 
example as the function involving the addition of the 6 
respective Dirac deltas, i.e. 


Ps( x ) = \ S(i) (11) 

2=1 

which is shown in Figure [l] 
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Figure 1: The probability distribution respective to the dice exper¬ 
iment as represented by a sum of Dirac delta ‘functions’ placed at 
the respective discrete outcome values. 


The area of ps(x) can be simply calculated by adding 
the values obtained from Equation [9j yielding 1 as result. 
Yet, the probability value at any of the points 1, 2, 3, 4, 
5, and 6 is infinite, so that the probability of any of the 
respective outcomes needs to be understood as the area 
of the respective Dirac delta distribution, in this case 1/6. 

We can infer from the above outlined application of the 
Dirac delta to obtain probability functions that these rep¬ 
resentations have a density nature. Indeed, if we have a 
density function, e.g. of mass, the quantity of mass within 
an interval [ 1 , 6 ] is obtained by integrating the density 
function along this interval. In probability, these func¬ 
tions are typically called probability distributions or den¬ 
sities. 

Let’s develop the concept of continuous probability dis¬ 
tributions by considering two-dimensional dice with vary¬ 
ing number of facets n, all with the same size (see Fig¬ 
ure [2]) . The smallest such dice has 3 facets (an equilat¬ 
eral triangle). These dice are thrown within a 2D limited 
space (e.g. in the interstice between two sheets of glass), 
and the outcome is understood as corresponding to the 
facet resulting in touch with the floor. 



Figure 2: Two-dimensional dice with 3 to 6 facets. 


The labels associated to each of the n facets are dis¬ 
tributed uniformly between two boundary values a and 6 
(e.g. within the interval [ 0 , 1 )), so that the facet label as¬ 
sociated to the i— th facet is (i — l)/n, with i = 1,2 ,..., n. 

That these outcomes are discrete random variables that 
can be represented by Dirac delta functions of the type 


n \ n 

2=1 x 


( 12 ) 


If we take the limit when oo, we get the continuous 
probability function pg(x) = 1 , which can be verified to 
be a probability distribution. 

Formally speaking, any function p{x) obeying the fol¬ 
lowing conditions can be a candidate for being the proba¬ 
bility distribution modeling the outcome of some random 
experiment: 


p(x ) > 0,Va;; 

/ OO 

p(x)dx = 1 

-OO 


(13) 

(14) 


and we have that ps{x) above is indeed a probability 
distribution. 

Given a probability distribution p(x), we cannot un¬ 
derstand the value of this function at a specific point, 
e.g. p( 2), as a probability. Instead, this value corresponds 
to the density of probability of the outcome at that specific 
point. So, the probability that the outcome lies within an 
interval [a, 6 ] can be immediately calculated as the area 
of p(x) along this interval, i.e. 


f b 

p(a < x < b) = / p{x)dx (15) 

J a 

Fortunately, this result also holds for discrete proba¬ 
bility distributions represented in terms of sums of Dirac 
delta ‘functions’. For instance, we have 


-P(2 < X < 4) = / ps(x)dx=~. (16) 
J 1.5 4 

All in all, we have that the adoption of the Dirac delta 
‘function’ to represent the probability of numeric out¬ 
comes turns out to allow an unified and convenient ap¬ 
proach capable of handling both discrete and continuous 
random variables in terms of ‘continuous’ probability dis¬ 
tribution functions potentially involving Dirac deltas. 

It is interesting to observe at this point that, though 
most real world events are probabilistic, it turns out that 
it is particularly challenging to produce perfectly uniform 
random variables. Indeed, much research has been done in 
devising effective methods for random number generation 
(e.g. 0 ). So, it is somehow surprising to experience such 
difficulties in a probability permeated universe. 
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Probability distributions correspond to a key concept 
in probability and statistics, especially because they pro¬ 
vide effective models of aspects (i.e. those quantified by 
the considered random variables) of a respective random 
experiment. Indeed, the probability function associated 
to a specific random variable X can be understood as pro¬ 
vide as much information about the behavior of X as it 
would be possible to obtain. 

It is also interesting to consider cummulative distribu¬ 
tions, which are defined from the respective probability 
distribution functions p(x) as follows 

P{X<x)= f p{X)dX (17) 

So, cummulative distributions accumulate the proba¬ 
bility along the X values, providing an indication of the 
overall probability of a specific observation x being com¬ 
prised in the interval extending from —oo to x. 

It follows immediately from Equation [17] that 

PW = ^. as, 


5 Statistical Moments 


We have already seen that, given a random variable X, it 
is possible to model it in terms of its respective probability 
distribution p(X). Because this function is so important 
for the characterization of the variable X, it is often inter¬ 
esting to derive some properties from it. A particularly 
interesting way to do so is through functionals, i.e. map¬ 
pings from the function p(X) to real values, such as the 
area (which yields 1). The expectance E\X\ of a random 
variable A' is such a functional, being defined as 


/ OO 

Xp{X)dX 

-OO 


(19) 


For instance, in the case of the uniform distribution 
p(X) = 1 in the interval [0,1), we have 


E{X] = [ Xp(X)dX 

Jo 


1 , 

= 2 ^ 


( 20 ) 


The expectance of X coincides with the concept of the 
mean or average of that random variable. E[X] is also 
known as the first moment of p(X). Observe that ex¬ 
pectance refers to a variable, while moment refer to the 
probability distribution function. 

It is interesting to observe that the interpretation of 


Equation 19 as the average (or mean) of the respective 
random variable X can be easily verified as follows. Let’s 
go back to the dice experiment, assume that we obtained 
the following number of observations of each possible out¬ 


come: 


Observe that this application of the expectance Equa¬ 
tion [19] corresponds precisely to what one would typically 
do when calculating the average of the set of obtained ran¬ 
dom variable values: adding all obtained values (i.e. 69) 
and dividing the result by the total number of experiments 
(i.e. 20). 

It is possible to extend the concept of moments to 
higher orders, also known as the k —th moments of p(X): 

/ OO 

x k p{x)dx (21) 

-OO 

By considering the random variable X as being trans¬ 
formed into another random variable X = X — p, where 
p = E[x\ is the average of X, we can define the k —th 
central moments of X as 

/ OO 

(. X-p) k p{X)dX (22) 

-OO 

The second central moment corresponds to the fre¬ 
quently used concept of variance o 2 = E 2 [X — p), while 
its positive square root corresponds to the standard de¬ 
viation a = -ky/o 22 of the random variable of interest X. 
Both these statistical measurements quantify the disper¬ 
sion of X around its average, though with different units. 

It is also possible to define non-dimensional quantifica¬ 
tions of dispersion, such as the coefficient of variation of 
a random variable X defined as c v = cr/p. 

Interestingly, it can be shown that under certain cir¬ 
cumstances (e.g. by using the Carleman theorem udd, ^ 
we take all the possible moments of a probability distribu¬ 
tion p(x), this will allow us to recover p(x), meaning that 
the set of infinite moments can provide as much infor¬ 
mation as the distribution probability about the random 
variable of interest X. That is so because each successive 
moment provides some new information about the prop¬ 
erties of p(x), until the mapping between moments and 
p(x) becomes bijective and, therefore, invertible. 


6 Some Probability Distributions 

The function ps derived above with respect to the limit¬ 
ing case of a 2—dimensional dice with an infinite number 
of facets represents one of the most important probabil¬ 
ity distributions, called uniform. More specifically, the 
uniform distribution can be defined as 


p{x) 


c = for x e [a,b\ 

0 , otherwise 


(23) 


where c is a real constant value so that the total area 
of p{x) is 1. 

This distribution is particularly important because it 
describes a situation in which all outcomes within the in- 
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Table 2: The mean interpretation of the the expectance E[X] of a random variable X corresponding to a hypothetical dice experiment. 


event 

count 

(X) (count) 

P(X) = ® 

(X) (P(X)) 

Ai = {1} 

3 

(1)(3) = 3 

3 

20 


to 

II 

4 

(2) (4) = 8 

4 

20 

(2)(^) = i 

^3 — {3} 

3 

(3) (3) = 9 

3 

20 

(3X1)) = § 

II 

3 

(4) (3) = 12 

3 

20 

(4X1,) = i 

M = {5} 

5 

(5)(5) = 25 

5 

20 

(5)(|j) = M 

A 6 = { 6 } 

2 

(6) (2) = 12 

2 

20 

(6)(&) = i 

Total: 

N = 20 

69 

l 

Mean = 3.45 = ff 


terval [a.b] have the same density probability. Though this 
distribution can be often used to model abstract, theoreti¬ 
cal situations such as the above considered 2 —dimensional 
dices, there are not many natural random variables that 
can be precisely modeled by this distribution. Indeed, it 
is difficult to think of natural experiments yielding out¬ 
comes perfectly uniform in the statistical sense. 

So, it is not surprising that many other probability dis¬ 
tribution functions have been derived and proposed. As a 
matter of fact, it should be remembered that there is an 
infinite number of potential probability distributions, as 
we can always renormalize a large class (e.g. having finite 
area) of functions so that they are always non-negative 
and have unit area. In this section, we review some of 
the most frequently adopted probability distribution func¬ 
tions, in addition to the already seen uniform distribution. 

Another particularly simple and singular example is the 
constant probability function, defined as 

p{x) = S(x 0 ) (24) 

This probability distribution is associated to random 
experiments having certain event, indicating that the out¬ 
come will always be equal to xq. 

Other probability distributions are presented in Ta¬ 
ble [3j together with their respective parameters, mean 
and standard deviation. This table also provides the re¬ 
spectively involved parameters, as well as the respective 
mean and standard deviation of X. By parameters it is 
meant the variables that can vary from one type of ran¬ 
dom experiment to another, allowing the adaptation of 
the model to the specific circumstances, but which remain 
fixed along different realizations of the same experiment. 

Of particular importance is the normal distribution , 
characterized by its symmetric bell shape. This distribu¬ 
tion is particularly important because of the central limit 


theorem, which states under certain circumstances that 
sums of independent random variables tend to present 
this type of distribution. 

Observe that the normal distribution involves two pa¬ 
rameters: the average p and standard deviation a of the 
respective random variable X. This is one example of 
probability distribution having statistical moments as pa¬ 
rameters, but this is not always verified for other distri¬ 
butions. 

Table |4] provides the percentage of probability com¬ 
prised in the interval with length 2 to 6 standard devi¬ 
ations centered at the average of any normal distribution. 

Observe that the interval of X having length equal 
to 4 standard deviation will comprise a very significant 
(i.e. 95.45%) percentage of the possible outcomes of X, 
while 6 standard deviations will incorporate almost ev¬ 
ery possible result. Yet, it is important to keep in mind 
that the normal distribution tends asymptotically to zero 
at both its left and right extremities, reaching null value 
only at —oo and oo, respectively. 

The log-normal is another probability distribution, 
which is associated to the normal but considers only non¬ 
negative values of the random variable X , being a poten¬ 
tial choice for modeling some random experiments pro¬ 
ducing this type of outcomes. 

7 Random Variable Transforma¬ 
tions 

Given a random variable, or measurement X , we can be 
interested in characterizing the statistical properties of 
a respectively derived new variable, such as Y = X 2 
or Y = log(X), provided some circumstances are met 
(e.g. x > 0 in the latter example). Such modifications of 
random variables are called random variable transforma- 
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Table 3: Some frequently used and/or interesting probability distribution functions and their respective parameters, mean and standard 
deviation. 


Distribution 

P(x) = 

parameters 

mean (y) 

st. dev. (a) 

Uniform 

c for x £ [a, b ]; 0 otherwise 

a, b 

a-\-b 

2 

1 \ib-af 

Constant 

S(x 0 ) 

x 0 

Xo 

0 

Exponential 

\e~ Xx 

A > 0 

1 

A 

1 

A 

Normal 

—A “• > 

G\J2n 

p,a 

M 

a 

Log-Normal 

1 -1 1 ( ln(x) — fj, \2 

-— -j=e 2 v ) 

X CTV27T 

y,cr 

e M+T 

\J{e ° 2 - l)e( 2 ^ +cr2 ) 

Logit 

1 —i°g(^ — 1 )—m 

1 1 c 2 

x(l-x) 

y, cr 

— 

- 

Student’s t 

r <T) 

WfT(|) ^ ^ J 

v > 0 

0 for V > 1 

^2 for * > 2 


Table 4: The probability comprised in the interval having length of 
2 to 6 standard deviations centered at the average of any normal 
distribution. 


P{X - 0 < X < X + cr) 

68.27% 

P{X - 2a < X < X + 2a) 

95.45% 

P{X - 3a < X < X + 3cr) 

99.73% 


tions. Since every measurement from the real world is a 
random variables, any scientific law, equation or relation¬ 
ship involve random variable transformations. 

Let’s consider the problem of deriving the probability 
distribution of a new random variable Y derived from 
X. We shall consider the so-called distribution function 
technique , which is presented in terms of the following 
example. 

Let A' be a random variable characterized by the re¬ 
spective distribution function 


so as to obtain the cummulative distribution function 
P(Y ) of the new random variable Y. Now, in order to 
derive the density distribution p{Y), all we have to do is 
to differentiate P(Y < y) with respect to y, i.e. 


p(y) 


dP(Y < y) 
dy 


1 Vu 

6 y 


1 < y < 16 


(27) 


which can be rewritten as p{y) = \y, 1 < y < 16. 

As expected, it can be verified that Jf° oo p(Y)dY = 1. 
Observe that this particular technique requires that the 
random variable transformation is one-to-one (or injec¬ 
tive) and that f(X) strictly increases along the interval of 
interest. An analogue method can be used for one-to-one 
and strictly decreasing random variable transformations 
(e.g. 0). 


8 Parametric and Non-parametric 
Estimation 


p(x) 


for 1 < x < 4 
0, otherwise 


(25) 


Let Y = /(A) = A 2 be a new random variable of 
interest. We can make 


P(Y < y) = P(X 2 <y) = P{X < yi) = 

1 

/ oo /*y 2 i 

p(x)dx = I -dx = 
-oo J 1 ^ 


3* 


= 3(^-1) 


(26) 


So far, we assumed that the probability distribution func¬ 
tions of a given random variable were somehow known, 
providing respective models. In some cases it is possible 
to derive theoretically the distribution through mathe¬ 
matical developments, such as is the case with inferring 
that the electrons constituting the beam of a cathodic 
ray tube follow a normal distribution. This is achieved 
by taking into account that the electrons perform a ran¬ 
dom walk as they move along the beam, and this type 
of stochastic dynamics can be shown to lead to a normal 
distribution (e.g. El)- However, in practice the distribu¬ 
tion functions are often unknown at first, and have to be 
estimated. 
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In this section we will briefly discuss two of the main 
approaches for estimating the probability distribution of a 
random variable: (i) parametric ; and (ii) non-parametric. 

In the former approach, the random variable of interest 
is known, or assumed, to have a specific type of proba¬ 
bility distribution, such as uniform, normal, etc. So, the 
estimation of the statistical model involves inferring only 
the involved parameters. 

This can be performed according to estimating equa¬ 
tions derived for each of the involved parameters. These 
equations are obtained by imposing optimal statistical ad¬ 
herence to the data, in approaches such as involving the 
maximum likelihood method (e.g. [12] 1. These estimators 
can be obtained from the literature with respect to a large 
range of distribution functions. 

Let’s illustrate parametric estimation with respect 
to statistically modeling of a random variable sam¬ 
pled from a normal distribution with average p = 3 
and standard deviation a = \J 2. For the purpose of 
our example and discussion, we assume that all that 
is know is the set of 8 respective observations S = 
{1.88; 2.54; 6.12; 3.14; 3.26; 6.43; 3.92; 0.47}. These values 
are illustrated in Figure [3^ a). 

The average and standard deviation of the values in S 
can be estimated as 

1 N 

( 28 ) 

i =1 

a2 = N^ E(^-M) 2 ( 29 ) 

1 i=l 

where N is the number of samples, yielding = 3.4697 
and a = 2.0187, which can be verified to be similar to the 
respectively original values. 

The so obtained normal distribution is shown in solid 
lines in Figure [3jb) together with the normal distribution 
using the original parameters (dashed lines). It can be 
verified that this parametric estimation procedure yielded 
a reasonable statistical model of X , especially considering 
that only 8 samples were available. 

The other main possibility for obtaining the probabil¬ 
ity distribution of a given random variable, namely non- 
parametric estimation , does not require the assumption 
of any particular type of a priori distribution function. 
Instead, the sought function is obtained through some 
kind of ‘interpolation’ of the relative frequency histogram 
describing the random variable being modeled. 

There are several possibilities for deriving a non- 
parametric distribution (e.g. E9), and here we consider 
only the smoothing of the relative frequency histogram by 
convolving (e.g. M) it with some smooth kernel (e.g. a 
normal function). 


First, we need to define what is the relative frequency 
histogram of a given random variable X. We start with a 
set of N respective observations S = {X 1 ,X 2 , ■ ■ . , A^v}- 
Having identified the minimum X m and maximum Xm 
respective values, we consider an interval A' £ [a, b] so 
that a < X rn and b > Xm- This interval is divided 
into n subintervals, or bins, henceforth represented as 
bi,b 2 ,. - - ,b n . 

Often, these bins are assumed to have the same size 
AA. All bins are initially set with zero count. Then, 
each of the elements in S is taken and the value of the bin 
in which it is contained is incremented. In the end, we 
have the histogram h(i) of X for resolution A A. If the 
values in each of the bins are divided by N, we obtain the 
respective relative frequency histogram f R (i ). 

Now, we can assign a Dirac delta function at the very 
middle Xi of each bin, having area equal to the the re¬ 
spective accumulated value. More formally, we can define 
the function 

n 

Fr{x) = ^2 ( 3 °) 

i=i 

It can be immediately verified that this function is non¬ 
negative and has area equal to 1, and therefore already 
provides a valid candidate distribution function for X. 
Indeed, it is interesting to observe that the probability 
distribution function p[x) of the considered random vari¬ 
able can be defined in terms of the following limit: 

P( x ) = ( 31 ) 

TV i —yoo 

However, this would require an infinite number of sam¬ 
ples, which is not available in our hypothetical example. 
Indeed, the distribution probability obtained from the 8 
sample values has many gaps between the Dirac deltas, as 
well as abrupt variations. So, we are interested in inter¬ 
polating this function in order to obtain a smooth, con¬ 
tinuous interpolation. 

This can be achieved by convolving F R (x) with a nor¬ 
mal function with average 0 and standard deviation a, 
yielding the approximate probability distribution 

/ OO 

- x)F R {x)dx (32) 

-OO 

The convolution of a given function f(x) with a sum of 
Dirac delta ‘functions’ can be verified (e.g. [14] ) to cor¬ 
respond to the operation of adding f(x) at the position 
of each of the Dirac deltas. Thus, given that the normal 
function g a {x) is intrinsically smooth, it will fill the gaps 
between the Dirac deltas, providing an interpolation of 
the original density. 







Figure 3: (a): The considered sample values represented as a sum of Dirac deltas, (b): The respective parametric estimation (solid line) 
compared to the original normal distribution having average p = 3 and standard deviation a = \/2. (c): The non-parametric estimations 
with a normal kernel with mu = 0 and a = 0.1; 0.4; 0.7;..., 1.6. (d): The chosen probability distribution (for a = 0.7) shown in solid lines, 
while the previously parametrically estimated distribution is shown in dashed lines. 


It is also possible to perform kernel-based non- 
parametric estimation of probability distributions by con¬ 
volving the kernel (a normal distribution in the previous 
example) directly with the Dirac deltas associated to each 
sample in S, therefore avoiding the construction of rela¬ 
tive frequency histograms. This procedure is simpler, but 
potentially more difficult to be performed quickly such as 
by the numerical calculation of the convolution in terms 
of the fast Fourier transform (e.g. 114]). 

Figure [3])c) illustrates the result of the convolution 
of the Dirac deltas defined by the original samples - 
shown in Figure [3]) a) - with a normal distribution with 
zero average and successive standard deviations a = 
0.1; 0.4; 0.7;..., 1.6. This was obtained by adding this 
normal distribution at each of the Dirac deltas corre¬ 
sponding to each of the original samples Xi. 

Observe that the smaller values of a yield respective 
probability distributions follow more closely the original 
sum of Dirac deltas in Figure [3ja) , implying respective 
gaps along the A'- axis. Larger values of a will fill these 
gaps, yielding a smoother probability distribution that is 
closer to the original normal shown by dashed lines in 
Figure [3jb) . However, too much smoothing will imply in 
substantial loss of detail about the original data. 

So, non-parametric estimation often requires some cri¬ 
terion to be applied in order to choose between the respec¬ 
tive parametric configurations (the choice of er values in 
our example). Though it is possible to obtain analytical 


expressions providing the best parameter choice with re¬ 
spect to some imposed criterion (e.g. quadratic error), for 
simplicity’s sake here we select the smallest value of a ca¬ 
pable of reasonably filling the gaps and yet not smoothing 
too much the resulting distribution, which would imply in 
missing too much information (details). 

Figure[3](d) illustrates in solid lines the chosen distribu¬ 
tion (for a = 0.7), while the normal distribution defined 
by the above discussed parametric estimation is shown in 
dashed lines. The smaller bump obtained at the righthand 
side of this distribution is a consequence of the relatively 
isolated original samples at X = 6.12 and X = 6.43. Such 
effects are sometimes called random fluctuations , which 
can impose some level of structure or artifacts on an oth¬ 
erwise uniform or smoother distribution as a consequence 
of limited number of samples. 

9 Concluding Remarks 

This text has presented in an introductory manner some 
of the principal concepts from the probability and statis¬ 
tics field. Having defined what random experiments are, 
it was possible to infer these correspond to a kind of sci¬ 
entific models taking into account uncertainties in the re¬ 
spectively obtained results. As such, in similar fashion 
to deterministic scientific counterparts, statistical models 
are also never guaranteed to be fully precise or complete, 
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because incompleteness in the representation of real world 
quantities as well as noise and experimental errors. 

Yet, statistical models are very important in science 
and technology because of this very same reason, i.e. they 
provide a principled means for coping with the uncom- 
pletness and some level of experimental error that are 
unavoidable in practice. 

After introducing the concepts of random variables and 
probability distribution functions in terms of discrete out¬ 
comes (such as in the dice experiment), we resorted to 
the Dirac delta ‘function’ as a means of integrating the 
statistical modeling of both discrete and continuous ran¬ 
dom variables. Some types of probability distributions 
were presented and discussed, and the important concept 
of statistical moments was also described and illustrated. 
We also briefly discussed the useful concept of random 
variable transformations, as well as presented an overview 
of parametric and non-parametric estimation. 

It is hoped that the current text can motivate and help 
the reader to probe further not only complementing the 
presented concepts, but also delving into other impor¬ 
tant topics such as multivariate statistics, decision theory, 
principal component analysis, and stochastic processes, 
among many other interesting possibilities. 
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Costa’s Didactic Texts - CDTs 


CDTs intend to be a halfway point between 
a formal scientific article and a dissemination 
text in the sense that they: (i) explain and 
illustrate concepts in a more informal, graphical 
and accessible way than the typical scientific 
article; and, at the same time, (ii) provide more 
in-depth mathematical developments than a more 
traditional dissemination work. 

It is hoped that CDTs can also provide integration 
and new insights and analogies concerning the 
reported concepts and methods. We hope these 
characteristics will contribute to making CDTs 
interesting both to beginners as well as to more 
senior researchers. 

Though CDTs are intended primarily for those 
who have some preliminary experience in the 
covered concepts, they can also be useful as 
summary of main topics and concepts to be learnt 
by other readers interested in the respective CDT 
theme. 

Each CDT focuses on a few interrelated concepts. 
Though attempting to be relatively self-contained, 
CDTs also aim at being relatively short. Links to 
related material are provide some complementa¬ 
tion of the covered subjects. 

The currently available set of CDTs can be found 
at: https://www.researchgate.net/pro j ect/ 
Costas-Didactic-Texts-CDTs. 
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